Topic Segmentation of Web Documents with Automatic Cue Phrase Identification and BLSTM-CNN
نویسندگان
چکیده
Topic segmentation plays an important role for discourse analysis and document understanding. Previous work mainly focus on unsupervised method for topic segmentation. In this paper, we propose to use bidirectional long shortterm memory(BLSTM) model, along with convolutional neural network(CNN) for learning paragraph representation. Besides, we present a novel algorithm based on frequent subsequence mining to automatically discover high-quality cue phrases from documents. Experiments show that our proposed model is able to achieve much better performance than strong baselines, and our mined cue phrases are reasonable and effective. Also, this is the first work that investigates the task of topic segmentation for web documents.
منابع مشابه
A hierarchical Convolutional Neural Network for Segmentation of Stroke Lesion in 3D Brain MRI
Introduction: Brain tumors such as glioma are among the most aggressive lesions, which result in a very short life expectancy in patients. Image segmentation is highly essential in medical image analysis with applications, particularly in clinical practices to treat brain tumors. Accurate segmentation of magnetic resonance data is crucial for diagnostic purposes, planning surgical treatments, a...
متن کاملA hierarchical Convolutional Neural Network for Segmentation of Stroke Lesion in 3D Brain MRI
Introduction: Brain tumors such as glioma are among the most aggressive lesions, which result in a very short life expectancy in patients. Image segmentation is highly essential in medical image analysis with applications, particularly in clinical practices to treat brain tumors. Accurate segmentation of magnetic resonance data is crucial for diagnostic purposes, planning surgical treatments, a...
متن کاملOntology based Web Page Topic Identification
With the emergence of the web, lots of research efforts are made in the area of Web Mining. This paper proposes an automatic approach for automatic topic identification from the web pages. The contribution of this research is in the approach of automatic topic identification of web pages that can provide better results. The topic of the web documents is identified through ontological approach.
متن کاملDiscovering Topic Boundaries for Text Summarization Based on Word Co-occurrence
Topic Segmentation is the task of breaking documents into topically coherent multiparagraph subparts. In particular, Topic Segmentation is extensively used in Text Summarization to provide more coherent results by taking into account raw document structure. However, most methodologies are based on lexical repetition that show evident reliability problems or rely on harvesting linguistic resourc...
متن کاملAutomatic title generation for Chinese spoken documents using an adaptive k nearest-neighbor approach
The purpose of automatic title generation is to understand a document and to summarize it with only several but readable words or phrases. It is important for browsing and retrieving spoken documents, which may be automatically transcribed, but it will be much more helpful if given the titles indicating the content subjects of the documents. For title generation for Chinese language, additional...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016